Enrich DurabilityAgent.CheckHealthAsync with persistence-layer signals (#2646)#2649
Merged
jeremydmiller merged 1 commit intomainfrom May 1, 2026
Merged
Enrich DurabilityAgent.CheckHealthAsync with persistence-layer signals (#2646)#2649jeremydmiller merged 1 commit intomainfrom
jeremydmiller merged 1 commit intomainfrom
Conversation
…tence signals (#2646) The three durability agents (Wolverine.RDBMS, Wolverine.RavenDb, Wolverine.CosmosDb) all relied on the default IAgent.CheckHealthAsync — Status==Running ? Healthy : Unhealthy. That hides the cases monitoring tools (CritterWatch's Agents tab) actually need to flag: a healthy-looking agent silently failing to reach the store, the DLQ ballooning because handlers are dying, or a recovery loop that's not draining a stuck batch. This commit threads three new persistence signals through every durability agent: 1. **Persistence reachability** — each agent's poll loop now wraps its tick in a try/catch and feeds the outcome into a per-agent `DurabilityHealthSignals` instance. CheckHealthAsync also pings the store via FetchCountsAsync. One failed cycle ⇒ Degraded with the underlying error message; N consecutive failures (default 3, `DurabilitySettings.HealthConsecutiveFailureUnhealthyThreshold`) ⇒ Unhealthy. 2. **Dead-letter queue growth** — between consecutive evaluations, compare the `PersistedCounts.DeadLetter` delta against `DurabilitySettings.HealthDeadLetterGrowthPerMinuteThreshold` (default 100/min). Above threshold ⇒ Degraded with the rate in the description. 3. **Stuck recovery / scheduled-job pollers** — if the persisted inbox+outbox total (or scheduled count) stays non-zero and never decreases across `DurabilitySettings.HealthStuckPollCycleThreshold` consecutive evaluations (default 3) ⇒ Degraded. Catches the "single bad envelope blocks the queue" case the issue calls out. Status precedence: a non-Running status always returns Unhealthy first; then the consecutive-failure Unhealthy; then the worst aggregated Degraded. Multiple Degraded signals are joined into a single `;`-separated description so operators see the full picture in one tooltip. `DurabilityHealthSignals` is intentionally public so per-store agents from the RavenDb / CosmosDb assemblies (which do not have InternalsVisibleTo into Wolverine) can use it directly. The class is deliberately small: shared mutable state, RecordPollSuccess/Failure mutators, and a single Evaluate() that takes the current PersistedCounts snapshot. Test plan: - New CoreTests/Persistence/durability_health_signals_tests covers the helper in isolation: status precedence, single-failure Degraded, threshold-based Unhealthy, DLQ growth above + below threshold, stuck-recovery + stuck-scheduled with reset behaviour, multi-signal aggregation, and the diagnostic counter accessor. 12/12 green. - Full CoreTests suite green: Failed: 0, Passed: 1421, Total: 1421. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2646.
Summary
The three durability agents —
Wolverine.RDBMS.DurabilityAgent,RavenDbDurabilityAgent,CosmosDbDurabilityAgent— all relied on the defaultIAgent.CheckHealthAsync:Status == Running ? Healthy : Unhealthy. That hides the cases monitoring tools (CritterWatch's Agents tab) actually need to flag: a healthy-looking agent silently failing to reach the store, the dead-letter queue ballooning because handlers are dying, or a recovery loop that's not draining a stuck batch.Threads three new persistence signals through every durability agent:
Persistence reachability — each agent's poll loop wraps its tick in
try/catchand feeds the outcome into a per-agentDurabilityHealthSignalsinstance.CheckHealthAsyncalso pings the store viaFetchCountsAsync. One failed cycle ⇒ Degraded with the underlying error message; N consecutive failures (default 3,DurabilitySettings.HealthConsecutiveFailureUnhealthyThreshold) ⇒ Unhealthy.Dead-letter queue growth — between consecutive evaluations, compare the
PersistedCounts.DeadLetterdelta againstDurabilitySettings.HealthDeadLetterGrowthPerMinuteThreshold(default 100/min). Above threshold ⇒ Degraded with the rate in the description.Stuck recovery / scheduled-job pollers — if persisted inbox+outbox (or scheduled) counts stay non-zero and never decrease across
DurabilitySettings.HealthStuckPollCycleThresholdconsecutive evaluations (default 3) ⇒ Degraded. Catches the "single bad envelope blocks the queue" case the issue calls out.Status precedence: a non-Running status always returns Unhealthy first; then the consecutive-failure Unhealthy; then the worst aggregated Degraded. Multiple Degraded signals are joined into a single
;-separated description so operators see the full picture in one tooltip.DurabilityHealthSignalsis intentionallypublicso per-store agents from the RavenDb / CosmosDb assemblies (which do not haveInternalsVisibleTointo Wolverine) can use it directly. The class is deliberately small: shared mutable state,RecordPollSuccess/RecordPollFailuremutators, and a singleEvaluate()that takes the currentPersistedCountssnapshot.Files
src/Wolverine/Persistence/Durability/DurabilityHealthSignals.cs(new) — the shared evaluator.src/Wolverine/DurabilitySettings.cs— three new threshold properties (defaults: 100/min DLQ growth, 3 stuck cycles, 3 consecutive failures).src/Persistence/Wolverine.RDBMS/DurabilityAgent.cs— replaces the existing_successCount/_exceptionCountrolling logic with the shared signals; adds count-based signals.src/Persistence/Wolverine.RavenDb/Internals/Durability/RavenDbDurabilityAgent.cs— addsCheckHealthAsyncoverride; wraps each recovery + scheduled-job tick in try/catch to feed the signals.src/Persistence/Wolverine.CosmosDb/Internals/Durability/CosmosDbDurabilityAgent.cs— same shape as RavenDb.Test plan
CoreTests/Persistence/durability_health_signals_testscovers the helper in isolation: status precedence, single-failure Degraded, threshold-based Unhealthy, DLQ growth above + below threshold, stuck-recovery + stuck-scheduled with reset behaviour, multi-signal aggregation, and the diagnostic counter accessor. 12/12 green.Failed: 0, Passed: 1421, Total: 1421, Duration: 3m 53s.🤖 Generated with Claude Code